\documentclass[12pt,a4paper]{article}

\usepackage{amsmath, amsthm, amssymb, tikz, realboxes, bibentry, natbib, url, a4wide, graphicx, verbatim, setspace} 
\usepackage[affil-it]{authblk}
\usepackage{listings}
\usepackage{graphicx,rotating,booktabs}
\usepackage[verbose]{placeins}
\usepackage{blindtext, float, varwidth}



\begin{document}
\author{Philipp Hunziker\thanks{Email: hunziker@icr.gess.ethz.ch}}

\affil{Center for Comparative and International Studies\\ ETH Z\"{u}rich}
\title{Does Petroleum Extraction Promote Ethnic Mobilization?\thanks{Prepared for the ISA Annual Convention in Toronto, 26 -- 29 March 2014.}}

\date{\today\\ \vspace{10 mm} \small{Work in progress. Please do not cite without permission.}}
\maketitle

\abstract{This paper analyzes whether individuals in petroleum-producing areas are more likely to mobilize along ethnic cleavages. 
I expect that ethnic mobilization is more common in petroleum abundant areas because it provides rent-seeking opportunities, and because the environmental and social costs of industrial petroleum production create incentives for local communities to confront the state as a unitary actor.
I use a new spatial matching procedure to test whether petroleum producing regions in South-Sahara Africa and Asia systematically feature a greater number of politically relevant ethnic groups than comparable, petroleum-free areas. 
Preliminary evidence suggests that there does seem to be a robust association between petroleum production and the spatial distribution of politically relevant ethnic groups.}

\newpage
\tableofcontents
\newpage
\onehalfspacing

\section{Introduction}
\label{Sec:1}

The Ogoni are an ethnolinguistic group home to Nigeria's Niger Delta. With approximately 750'000 members in a country of over 150 million, the Ogoni constitute a minuscule minority. Their vanishingly small demographic weight within Nigerian society is emphasized by the composition of the country's dominant ethnic identities. Nigeria's post-colonial history has in large parts been determined by the struggle between the Hausa-Fulani, Yoruba, and Igbo groups, consisting of approximately 30\%, 20\% and 20\% of the population, over control of the central state. Against these odds, by the early 1990s, Ogoni identity has become a highly salient basis for political mobilization, with several political organizations making explicit ethnic claims and receiving national and international attention \citep{HRW1995}. Though Ogoni mobilization has its origins in the 1940s, mass mobilization peaked in 1993, against the backdrop of Nigeria's gradual transition away from military dictatorship at the time. Regular mass protests were held, calling for increased autonomy in local governance, and even an independent Ogoni state. The history of Ogoni mobilization is intricately interwoven with the Niger Delta's petroleum industry. Ogoni calls for autonomy have centered almost exclusively on claims related to oil production. In particular, local leaders repeatedly addressed the allegedly small share of oil revenue channeled back to local communities by the central state,\footnote{However, apparently, policy granting oil producing communities preferential access to oil revenue has been in effect even before these vocal mobilization efforts (Osaghae 1995: p. 332).} and the devastating environmental and social burdens caused by the extraction process \citep{Osaghae1995}.

In the case of the Ogoni, it is difficult to escape the conclusion that petroleum extraction has been a key driving force underlying political mobilization on the basis of the Ogoni identity. Moreover, a brief review of the qualitative literature suggests that the Ogoni are not an exception. Claims that large-scale resource extraction have sparked or facilitated ethnic mobilization have been made in a number of cases, such as the Subanon in the Philippines \citep{Rovillos2012}, the Cabindans in Angola \citep{Porto2003}, or the Acehnese in Indonesia \citep{aspinall2007construction}. Based solely on case studies, however, it is impossible to determine whether petroleum extraction is generally associated with a higher likelihood of ethnic mobilization, or whether the two phenomena are essentially unrelated, and the aforementioned cases have simply received disproportionate attention due to the involvement of a high-value resource. The goal of the present paper is to investigate whether the link between petroleum production and ethnic mobilization represents a generalizable causal effect, and if so, to what degree. Specifically, I address the question of \emph{whether petroleum extraction promotes political mobilization based on ethnic identities.}

Addressing this question is worthwhile for at least three reasons. First, gaining a better, more systematic understanding of the political consequences of petroleum extraction is a key requirement for anticipating likely future scenarios in regions facing resource windfalls. Should we, for instance, expect newly discovered oil and gas reserves in East Africa (specifically, Kenya, Uganda, Tanzania and Mozambique, \citealt{Economist2012}), to have an impact on the (sometimes delicate) ethnopolitical equilibria in the respective countries? Second, answering the proposed research question should improve our understanding of the determinants and mechanisms underlying the emergence of politically salient ethnic identities on a more fundamental level. In particular, providing evidence for a systematic link between petroleum extraction and politically relevant ethnicity would speak in favor of instrumentalist theories of ethnic salience. Given the immense wealth associated with large-scale petroleum extraction, it is difficult to devise of an explanation of the proposed phenomenon that does not involve materialistic motivations as a contributor to ethnic mobilization. Finally, investigating the link between petroleum and ethnic mobilization may have implications beyond the study of ethnic politics. Specifically, whether petroleum extraction affects the political salience of ethnic identities may be relevant to the literature addressing the various ``resource curses''. Natural resource abundance, and oil and gas in particular, have been linked to a variety of adverse social outcomes, such as intrastate conflict (e.g., \citealt{Ross2006}, \citealt{Lujala2010}), persistent non-democratic regimes (e.g., \citealt{Ross2001}) , and slow economic growth (e.g., \citealt{Sachs2001}). Finding that petroleum affects the emergence of politically relevant ethnicity might provide important theoretical inputs to these research agendas. Perhaps, indeed, some of the negative effects attributed to petroleum abundance run through the creation of ethnic cleavages. Finally, a statistical link between petroleum and ethnic salience would suggest that ``ethnicity'' should not be interpreted as an exogenous factor when analyzing the policy-implications of petroleum wealth and extraction. In particular, this would suggest that testing resource-related hypotheses while “controlling” for ethnicity as a competing explanation for the outcome under scrutiny, as, for instance, practiced by \citet{Collier1998} in their analyses of civil war, is inappropriate. 

The remainder of this paper is structured as follows. The next section provides a brief overview of the existing literature on the empirical relevance, and the emergence, of politically salient ethnic identities. Section \ref{Sec:3} introduces a theoretical framework of the possible mechanisms from petroleum extraction to ethnic mobilization, and derives empirically testable implications. Next, section \ref{Sec:4} introduces the spatial research design at the core of this paper. Finally, results are discussed in section \ref{Sec:5}. 


\section{Literature Review}
\label{Sec:2}

In recent years, a growing body of empirical research has explicitly addressed the idea that political salience is a key component to understanding the link between ethnicity and a wide range of policy outcomes. Specifically, it is increasingly accepted that for understanding the role of ethnicity in shaping politics and policy, it is of crucial importance to distinguish between what one may call the ethnic ``source material'' of a country, that is, the pool of religious, linguistic and phenotypical categories that may potentially serve as the basis for ethnic identification, and those ethnic identities that serve as salient cleavages in the political process. In this sense, ethnic identities become salient if they are successfully used by elites for the purpose of political mobilization, i.e., when elites obtain support by claiming to represent, and act for the benefit of, an ethnically defined populace.

The idea that ethnic identities are context-dependent is not particularly new. Indeed, constructivists have long argued that framing ethnicity as a demographic constant is misleading (see \citealt{Fearon2000} for an overview). Rather, so the constructivist argument, ethnicity is best understood as the result of a social process; that is, ethnicity does not naturally group individuals into internally coherent and externally incompatible categories, but is only as meaningful as humans make it to be through discourse and actions that reiterate the idea of otherness \citep[p. 848]{Fearon2000}. Consequently, depending on the time of analysis, or even the issue under scrutiny, individuals belonging to two social categories may consider ethnic distinctions highly relevant and associated with strong expectations about others' political preferences and actions, or completely meaningless. Once we accept this premise, it seems trivial to conclude that any analysis of the effects of ethnic diversity on policy outcomes should thus attempt to distinguish between meaningful ethnic cleavages and purely anthropological social categories \citep{Laitin2001}. However, partially because collecting data on salient ethnic identities is cumbersome, partially because of disciplinary boundaries, the constructivist insight that salience should be accounted for has only recently found entrance into quantitative analyses of the role of ethnic diversity for policy outcomes. 

In accordance with constructivist expectations, those studies that do attempt to distinguish between politically meaningful ethnic categories and purely demographic diversity find that whether a particular identity is mobilized is highly important for explaining policy outcomes. \citet{Posner2004}, for instance, criticizes the widespread use of the ELF (Ethno-Linguistic Fractionalization) index in cross-country growth regressions (see, e.g., Alesina and La Ferrara 2004), and finds that a time-variant ethnic fractionalization index that only incorporates politically salient ethnic groups outperforms the ELF in explaining macroeconomic policy and long-run growth rates in Africa. Similarly, \citet{Cederman2007} and \citet{Cederman2010} criticize the use of the ELF index in studies addressing the onset of violent intrastate conflict. They argue that previous claims by \citet{Collier1998} and \citet{FearonLaitin2003} on the apparent irrelevance of ethnic identities for explaining the outbreak of civil war may have been premature. Rather, \citet{Cederman2010} argue that the latter authors' non-findings are related to their use of the ELF index, which fails to capture those ethnic cleavages that are relevant for explaining the outbreak of ethnic conflict.  Indeed, using a data set that explicitly identifies politically relevant ethnic groups and their access to state power,\footnote{Incidentally, this is also the data set underlying the empirical analysis in the present paper.} \citet{Cederman2010} find that there is strong relationship between ethnopolitical exclusion and civil conflict.

\subsection{Explanations of Ethnic Salience}

Naturally, if it is the case that whether an identity is being used for political mobilization mediates the link between ethnicity and policy outcomes, this raises the question of why and when social identities become politically relevant in the first place. The goal of the present paper is to contribute to answering this question by investigating the role of a very specific phenomenon -- petroleum extraction --  in promoting ethnic mobilization. 

Although, to my knowledge, the present paper constitutes the first attempt to analyze the effect of petroleum on ethnic salience systematically, it builds on an extensive body of literature trying to explain the emergence of politically relevant ethnic identities more generally. Following \citet{Fearon2000}, it is helpful to group this literature into three more or less distinct theoretical frameworks. First, one prominent approach that addresses the roots of political identity formation are the macrohistorical accounts of the emergence of nationalism by \citet{Deutsch1953}, \citet{Gellner1983} and \citet{Anderson2006}. These authors focus explicitly on nationalist identities (in contrast ethnic identities, which need not be nationalist), and argue that nationalism has emerged as the product of long-term social processes, such as the rise of the modern state system, economic modernization, and mass communication. These processes are argued to have made nationalist identification appealing for emergent mass publics. 

A second strand of literature that tries to understand the emergence of social identities focuses on the role of discourses in shaping individuals' perceptions of group membership. Here, it is argued that ``individuals are pawns or products of discourses that exist and move independently of the actions of any particular individual'' \citep[p. 851]{Fearon2000}.  

The third approach, which has gathered increased attention in recent years and is most immediately relevant for the research question at hand, are instrumentalist explanations of ethnic salience. These theories hold that politically relevant ethnic cleavages emerge because ethnic mobilization is beneficial for some, or all, group members. In contrast to the macrohistorical approach in the classic literature on nationalism, these theories look more specifically at the issue of the emergence of multiple ethnic identities within the same polity (in contrast to a unifying nationalist identity), and operate on a much shorter time frame. Here, the incentives for identifying as part of a particular ethnic group in the political process are located at the level of more or less immediate political payoffs, rather than arising from sweeping social transformations. In contrast to the discourse theoretic approach, instrumentalist theories postulate a clear, individualistic means-ends calculation at the center of the identity formation process, rather than locating agency at the supra-individual level. One prominent body of literature within the instrumentalist framework attempts to explain the emergence of politically salient ethnic identities by focusing on political elites who incite inter-ethnic violence for their own political survival (\citealt{DeFigueiredo2000}, \citealt{Fearon2000}) Other, more recent publications locate incentives for ethnic mobilization not at the elite level, but at the level of mass publics, and focus less narrowly on episodes characterized by political violence. These approaches model ethnic salience as the product of a process that resembles a strategic choice scenario: Opting for political mobilization along a certain ethnic identity occurs because it is beneficial for its members, especially in the context of gaining access to state-funded public goods or achieving otherwise favorable policy outcomes. Different authors have proposed various factors allegedly determining when individuals choose to mobilize along a certain ethnic cleavage. \citet{Posner2007}, for instance, highlights how political institutions set incentives for organizing along certain ethnic dimensions. \citet{Esteban2008} argue that individuals choose to compete over government resources in ethnic coalitions if there is substantial intra-group economic inequality. 

Most relevant for the research question at hand, however, are the contributions by \cite{Fearon1999} and \cite{caselli2013theory}, henceforth collectively referred to as ``FCC''. These authors argue that individuals choose to adopt ethnic identities in order to form effective minimal winning coalitions in the contest over government resources. Similar to other instrumentalist theories of ethnic salience, these authors assume that individuals organize along ethnic lines in order to gain access to government resources that can be selectively distributed to members of the group. FCC's key contribution, however, is that they provide an explanation for why individuals choose ethnicity, as opposed to other group-membership criteria, as the basis for political coalitions. The reason individuals organize along ethnic identities, rather than any other possible category (class, region, issue-specific interest groups), according to these authors, is that ethnic groups serve as effectively enforceable minimal winning coalitions. The term ``minimum winning'' refers to the assumption that in the struggle over state resources, individuals face incentives to form coalitions that are large enough to beat other contestant groups, but only by a minimal margin, as to maximize the per-capita value of the resulting spoils. In this context, ethnicity is argued to be a particularly effective criterion for defining group membership, because it prevents outsiders from joining a coalition once it has gained access to state resources. Since social markers that serve as the basis of ethnic identities, such as language, religion and phenotype, are often easily visible and difficult to change, coalition membership can be enforced even after the inter-group contest over the rewards. If group membership is not enforceable, rewards will be diluted, and, in anticipation of this effect, the initial incentive for mobilizing minimum winning coalitions will be much weaker.  

From this basic model of ethnic salience FCC derive a number of empirical predictions with regard to where we should observe politically relevant ethnicity, and which social categories should emerge as salient ethnic identities. One noteworthy prediction in both papers is that particularly rigid and easily detectable social markers, such as phenotype and language, should be associated with salient ethnic cleavages more often than more porous and inconspicuous attributes, such as religious denomination. More importantly, however, both articles relate the emergence of politically relevant ethnicity to the availability of government resources that can be appropriated by ethnic coalitions and distributed to their members. These resources, which Fearon calls ``pork'' and Caselli and Coleman refer to as ``expropriable assets'', have in common that the government manages their allocation, they are excludable (otherwise coalition formation would be unnecessary), and they are rivalrous in consumption (thus creating incentives for minimal winning coalitions). Examples include lump-sum handouts or civil service positions. This prediction is of particular relevance for the research question at hand because it has very straightforward implications with regard to the effect of large-scale natural resource endowments on identity formation. Resource windfalls clearly satisfy all of the above mentioned criteria, and should thus provide major incentives for ethnic mobilization. In fact, \cite{caselli2013theory} explicitly mention mineral resource revenue as an incentive for ethnic coalition building.  

\subsection{The Role of Petroleum}

Despite the fact that the instrumentalist framework provided by FCC would imply a very straightforward connection between natural resource wealth and politically salient ethnic identities, the relationship has so far not been tested systematically. However, as already eluded to in the introduction, there are good reasons to do so. First, showing that petroleum affects the emergence of politically relevant ethnic groups would lend considerable support to the instrumentalist approach in general, and Fearon (1999) and Caselli and Coleman's (2013) theory in particular. Though, as I will argue in the next section, the latter authors' theory is certainly not the only possible explanation for a possible link between petroleum extraction and ethnic salience, finding according evidence would strongly speak in favor of the instrumentalist assumption that ethnic salience is the outcome of relatively short-term, individual-level means-ends reasoning. Second, demonstrating an empirical link between petroleum extraction and ethnic salience would add further evidence to the constructivist understanding of ethnicity, and underscore Laitin and Posner's (2001) argument that empirical researchers should not frame ethnicity as a demographic constant, but as a context-dependent and evolving social construction. Finally, finding that petroleum affects the emergence of politically relevant ethnic identities may have major implications for the various research agendas relating petroleum extraction to adverse political outcomes, such as slow economic growth (e.g. \citealt{Sachs2001}), failure to democratize (e.g. \citealt{Ross2001}, \citealt{Smith2004}, \citealt{Jensen2004}), or violent intrastate conflict (e.g. \citealt{Humphreys2005}, \citealt{Ross2006}, \citealt{Lujala2010}). Showing that petroleum affects ethnic salience would have important theoretical implications for these literatures, in that it would suggest that the various ``resource curses'' may run through the creation of horizontal ethnic cleavages. In the case of regime type, for instance, it may be the case that petroleum impedes democratization because it allows leaders to stay in power by effectively paying off relatively small, well defined ethnic coalitions. Similarly, the link between petroleum and violent conflict may be caused by the emergence of previously politically irrelevant ethnic communities that contest government authority over the allocation of resource rents (similar arguments have been made by \citealt{Ross2006} \citealt[ch. 5]{Ross2012}). Moreover, if it is the case that petroleum extraction affects the creation of politically salient ethnic identities, this may have important implications for how we should analyze the various ``resource curses'' empirically. For instance, if the effect of petroleum on the outbreak of civil conflict runs through the emergence of salient ethnic identities, adding measurements of ethnic fractionalization and resource wealth as linear predictors into a conflict-onset regression to see which explanation receives more evidence, as is practiced by Collier and Hoeffler (1998), will generate misleading results, since we are controlling for an anteceding factor.  Similarly, if petroleum affects the emergence of politically relevant ethnic groups, analyses attempting to estimate the effect of petroleum on the outbreak of ethnic conflict that use politically relevant ethnic groups as the unit of analysis (such as \citealt{Sorens2011}) will suffer from selection problems.


\section{Theoretical Framework}
\label{Sec:3}

In this section I present a theoretical framework with the goal of explaining why we should expect the emergence of politically salient ethnic identities in the presence of large-scale petroleum extraction. Far from introducing a truly novel model of identity formation, I will adopt the core premises of the instrumentalist theories of ethnic salience, and build substantially on the work of Fearon (1999) and \cite{caselli2013theory}. However, I will attempt to complement the latter authors' arguments, which would imply that the role of natural resources in creating and modifying ethnic identities runs primarily through monetary incentives, with an alternative explanation, which locates the identity-establishing effects of petroleum in its adverse impacts on local communities.

To make the exposition more accessible, I divide it into three successive parts.
\begin{itemize}
\item First, I argue that petroleum extraction creates incentives for individuals living near production sites to organize politically, thus creating horizontal and geographically delimited political cleavages.
\item Second, I argue that there are reasons to expect that this process will lead to relatively small political coalitions, that is, groups that contain only a relatively small fraction of a country's population.
\item Finally, I will present arguments that suggest that such coalitions will often be defined via ethnic markers, rather than any other social delimiter.
\end{itemize}

\subsection{Horizontal Cleavages}

I argue that proximity to petroleum extraction sites creates incentives for political mobilization among the local population. Specifically, I present two explanations for why this is the case.

First, building on Fearon (1999) and Caselli and Coleman's (2013) framework, one may argue that petroleum extraction may set incentives to build horizontal cleavages for the purpose of acquiring resource rents, and distributing them among group members. However, FCC's argument does not yet entail a geographical component -- why should individuals in proximity to resource extraction sites face incentives to organize into a common coalition, rather than to join any other potential grouping of individuals in a country? I argue that this is the case because individuals anticipate that a group consisting of citizens living in proximity to a petroleum extraction site will be in a privileged bargaining position vis-\`a-vis other coalitions in the struggle over access to government revenue. Groups inhabiting the area surrounding resource extraction sites will be able to gain access to a greater share of resource windfalls because they have substantial bargaining leverage -- they may threaten to use institutional or extra-institutional means to impede the state's ability to extract petroleum, and thus shrink the total size of the pie available for distribution. Such measures may range from purely legal challenge, to organized protests, to the ultimate threat of attempting secession and thus cutting off the rest of the country from resource spoils entirely.

A second argument for why individuals in proximity to petroleum extraction sites face incentives to mobilize is that the extraction process itself creates common policy preferences, which are most effectively pursued by pooling one's resources. It is widely known that, especially in developing countries, petroleum extraction is often accompanied with a host of negative externalities for local communities. Without well-established local governance structures and an effective legal system, the establishment of an extractive industry is very likely to produce substantial costs for the surrounding population. For instance, in Indonesia's Aceh province, the combination of rapid industrialization with little or no governance institutions in place has led to land expropriation, catastrophic pollution, and massive in-migration \citep[p. 35]{Kell2010}. The same is documented for many other cases of natural resource extraction, for instance in the Niger Delta \citep{Watts2004}, Sierra Leone \citep{Richards1996} and Ghana \citep[p. 12]{Switzer2001}. These externalities will likely shift local individuals' political demands towards the management of these issues, thus aligning the policy preferences of individuals in petroleum producing areas, and creating the basis for effective political mobilization.

It is worth noting that these two arguments for why we should see political mobilization in petroleum producing areas have appeared in similar form in the literature that attempts to explain the link between natural resource production and violent conflict. There, arguments of the first type, which relate political mobilization to``price-grabbing'' incentives, and arguments of the second type, which highlight the negative externalities of the extraction process, are often framed as competing mechanisms for explaining the statistical relationship between resource extraction and violent civil conflict (see, e.g., \citealt{Ross2004a}, \citealt{Humphreys2005}). However, there is little reason to believe that these are mutually exclusive. In fact, they appear to be fairly complementary: individuals in resource producing regions may organize for the purpose of appropriating a greater share of resource rents precisely because they feel entitled to be compensated for the adverse effects of petroleum production. 

\subsection{Small Coalitions}

I argue that resource extraction will not only set incentives for the emergence of spatially concentrated political coalitions, but that these coalitions will typically be relatively small in comparison to the overall population of the country. Again, I propose two mechanisms underlying this claim. The first builds on the above-made argument that political mobilization in resource-rich areas occurs for the purpose of appropriating resource rents for in-group distribution. As argued by FCC, because such rents are rivalrous in consumption, there are strong incentives to form minimal winning coalitions, as to maximize expected per capita payoffs. However, as highlighted by FCC, these incentives will be affect any coalition building effort with the goal of appropriating political ``pork'', not just coalitions in resource-producing areas. So why would we expect the latter to be particularly small? The answer is, again, bargaining leverage. Having direct access to petroleum extraction sites eases the requirement to create larger coalitions with more members for the purpose of gaining leverage vis-\`a-vis other contestant coalitions. Put differently, proximity to extraction sites increases a group's per-capita leverage; you can gain a lot of influence on the national scene if you are a well-organized group sitting on top of your country's main source of income. Hence, while incentives to appropriate resource revenue by forming a particularly small political coalition are universal, only groups in the producing areas have the necessary political leverage to actually do so. 

The second mechanism I propose to explain coalition size in petroleum rich areas builds on the argument that the negative externalities associated with resource extraction create according political demands only on a very local scale. Resource extraction often entails consequences so severe that it is straightforward to expect that their management become the most dominant policy issue in local communities, and will generate significant potential for mobilization. However, because these externalities are spatially confined to extraction sites, the absolute number of individuals affected by them, and willing to mobilize along this issue, will be relatively minor. In fact, because the effects of imposing stricter regulation on petroleum extractors will only benefit local communities, yet impose costs on the entire country (even if only in the form of opportunity costs), we may expect that individuals living outside resource extracting areas will have little interest in introducing such measures. Hence, because the adverse effects of resource extraction are spatially concentrated, yet represent a highly salient policy issue for the affected communities, we would expect to see the emergence of relatively small political coalitions in resource rich areas that mobilize around demands for stricter regulation of the petroleum industry.

\subsection{Ethnic Mobilization}

Even if, as postulated above, large-scale petroleum extraction creates incentives for mobilization into geographically concentrated political coalitions, this does not imply that the latter need to be defined in terms of ethnic identities. Alternatively, one could imagine mobilization on the basis of purely geographical delimiters (residents of a valley, or a river delta), or administrative or federal units. 
I identify two explanations for why we should expect ethnicity to be a particularly effective basis for coalition building in this context.

First, there is of course FCC's key argument that an ethnicity-based membership criterion is an attractive choice because it makes group membership more easily enforceable. Other than a criterion that relies purely on place of residence, ethnicity provides social makers that are often easily detectable, and, more importantly, difficult to change for individuals. Hence, if coalitions from petroleum-rich areas are successful in acquiring a significant share of resource rents to distribute among their members, an ethnic membership criterion prevents outsiders from joining the group ex-post, and diluting per-capita payoffs. In fact, the fear that outsiders may want to benefit from in-group spoils may be especially pertinent in resource-producing communities, since these often experience substantial in-migration. In the absence of permanent markers that easily distinguish insiders from newcomers, members of resource producing communities may feel that their efforts to gain access to resource windfalls unjustly benefit newly arriving immigrants. It is interesting to note that this argument does not hold if mobilization in resource-rich areas takes place primarily because locals want to eliminate the costs associated with large-scale petroleum extraction. Policies that combat the latter, such as environmental regulation and due process rules for land expropriations, are not political``pork'' because they are not rivalrous in consumption. Consequently, the value of these policies to local communities does not diminish in the number of beneficiaries, and hence there is no need to enforce group-membership once they are adopted. 

However, there is a further explanation for why individuals in resource-rich regions would choose to mobilize along ethnic lines. If an ethnic group is based on a common language (or, less-so, a common religious denomination), it is likely that its members' social networks will consist mainly of other group members.  These pre-existing social networks, as well as the possibility of creating a public discourse that addresses group members exclusively, will make mobilization substantially less costly \citep{Bates1983}. Discussing this argument, Fearon (1999) also adds that repeated in-group interaction may foster trust among coethnics, which further facilitates mobilization \citep{Fearon1996}. Hence, resource-rich communities may organize along ethnic lines simply because it is a particularly effective mobilization strategy. 

\subsection{Testable Implications}

In the preceding section, I have established that petroleum extraction should provide incentives for political mobilization of relatively small ethnic groups that are spatially concentrated in petroleum producing areas.
The goal of this section is to translate these theoretical expectations into empirically testable implications. 

Formulating an explicit hypothesis that allows testing the above made arguments in a generalizable manner is complicated by the fact that ethnopolitical landscapes differ considerably across and within countries, even in the absence of high-value resources. Let us consider two counterfactual scenarios to illustrate the challenges involved.
First, consider the following scenario: We observe a petroleum-free region within some country where, for reasons beyond the scope of this paper, we do not observe politically relevant territorially delimited ethnic groups. Though individuals may identify with various subnational ethnic groups, perhaps along linguistic divides, the latter do not constitute politically salient categories. Now imagine the counterfactual case where sizable oil reserves are located in the region under consideration. According to the theoretical framework discussed above, the latter situation should create significant incentives for individuals located near the petroleum extraction site to mobilize along some common ethnic delimiter, such as a common language, and issue political demands based on their ethnic identity. This counterfactual comparison is illustrated in figure \ref{Fig:Hyp} and labeled \emph{scenario A}. The circular extract depicts a hypothetical arbitrary region within a country, the red dot represents a productive petroleum field, and the green area portrays the settlement area of some politically relevant ethnic group.

\begin{figure}[H]
	\centering
	\includegraphics[scale=0.5]{Plots/hypothesis.pdf}
	\caption{Counterfactual case comparisons for different local ethnopolitical constellations. }
	\label{Fig:Hyp}
\end{figure}

Next, consider the scenario of a petroleum-free region where individuals identify with two relatively large politically relevant ethnic groups. Again, we picture the counterfactual case where there are substantial petroleum reserves in the center of the given area. According to our theoretical framework, we might expect that petroleum production sets incentives for individuals to mobilize along a smaller ethnic identity than they would otherwise, composed of individuals living exclusively in the vicinity of the extraction site. In practice, one could imagine that individuals identify and mobilize on the basis of a common local dialect in the petroleum-abundant case, whereas mobilization and self-identification would focus on the larger language family in the absence of the incentives generated by high-value resources. This counterfactual comparison is labeled \emph{scenario B} in figure \ref{Fig:Hyp}, whereas the dark shaded areas represent the two larger politically relevant ethnic groups.

In order to pursue a comprehensive test of the proposed theoretical framework, it is desirable to derive a hypothesis that mirrors the effects expected in either scenario, since both conform to the logic proposed in the theoretical discussion.
However, doing so is not entirely trivial.
Scenario A may suggest testing the simple hypothesis that petroleum-rich regions should be more likely to feature at least one politically relevant territorial ethnic group in comparison to petroleum-free regions. However, this hypothesis constitutes only an incomplete test of the theory, since effects of the type depicted in scenario B would be ignored. In other words, the effect of petroleum production on ethnicity, even if present, would go unnoticed in regions where salient territorially delimited ethnicity is prevalent even in the absence of petroleum production.
Alternatively, one might consider testing the hypothesis that the average politically relevant ethnic group in petroleum-rich regions should be smaller (in demographic terms) than the average group in petroleum-free regions. Clearly, this test would be able to pick up the type of effect depicted in scenario B, where we expect a fractionalization of otherwise large groups into smaller ones. Unfortunately, this formulation induces a selection problem. Group size is only observable where we observe a politically relevant ethnic group in the first place -- consequently, the analysis would be restricted to a sample of areas featuring at least one politically relevant ethnic group. As is well known in the econometric literature, this type of non-random selection may induce substantive bias in statistical estimates.
As a third option that avoids these issues, I propose the following hypothesis: 
\begin{center}
\parbox{0.8\textwidth}{
	\emph{Petroleum producing areas should exhibit a larger number of politically relevant and territorially delimited ethnic groups than comparable, petroleum-free regions.}
}
\end{center}
This test of the theory is appealing because it applies for both scenarios considered, and its response variable is observable throughout the sample.   


\section{Research Design and Data}
\label{Sec:4}

To test the proposed hypothesis, I adopt a quantitative cross-sectional design based on randomly sampled, arbitrarily defined spatial units. Specifically, the underlying idea is to compare spatially defined areas featuring petroleum extraction (treated units) with similar areas of equal size without petroleum production (control units), and test whether we can find systematic differences in ethnic mobilization. Arbitrarily defined areal units of analysis allow avoiding the potential endogeneity issues arising from the use of ``natural'' units of analysis, such as ethnic settlement areas or administrative units, which may themselves be shaped by large-scale petroleum production.

Because petroleum production may not be randomly distributed across the globe, I apply a novel geographic matching procedure for sampling the areal treatment and control units. In particular, I match on a range of geographic and demographic control variables which may affect both petroleum production and ethnic mobilization and exclusion. Further, I adopt a fixed-effects-like design and match on country dummies in order to eliminate country-level confounders.  

Finally, I analyze the matched data set with spatial econometric models. Specifically, since the sampled areal units may feature spatial interdependence, I use spatial models to test whether petroleum extraction affects the presence of politically relevant ethnic groups.
In the remainder of this section, I will first introduce the data and case selection, and then proceed to discuss the geographic matching design and the econometric models used. 

\subsection{Data}

\paragraph{Outcome Variable} \hspace{0pt} \\
I use the EPR and GeoEPR data sets in order to identify whether a given area is inhabited by members of a politically relevant ethnic group. The EPR (Ethnic Power Relations) data set is an effort to identify all politically relevant ethnic groups and their access to state power, across the globe, for the period between 1946 and 2009 \citep{Cederman2010}. Groups are considered politically relevant ``if at least one significant political actor claims to represent the interests of that group in the national political arena, or if members of an ethnic category are systematically and intentionally discriminated against in the domain of public politics'' \citep[p. 2]{Vogt2011}. The GeoEPR data set is a spatial extension to EPR and provides information on ethnic groups' settlement type and territory within a country \citep{Wucherpfennig2011}. In particular, for ethnic groups with territorially delimited settlement areas, which are of particular interest for the present analysis, GeoEPR provides geo-coded polygons identifying their settlement area.\footnote{GeoEPR also codes groups that live exclusively in cities, have seasonal migration patterns, or are dispersed across an entire country's territory. These groups are not included in the present analysis.} I map the GeoEPR polygons for groups with spatially delimited settlement areas onto the areal units of analysis (described below) in order to identify whether a given region is inhabited by members of a politically relevant ethnic group. Figure \ref{Fig:epr} illustrates the mapped GeoEPR groups. 

In principal, EPR and GeoEPR are time-variant data sets covering the entire period between 1946 and 2009. However, I opt for a cross-sectional design, and construct a time-invariant response variable that uses data from a specific year. The reasoning behind this decision is twofold. First, there is substantial uncertainty in the intertemporal component of the available data, which limits the value of employing time-series analysis. This is true in particular with regard to the PETRODATA data set \citep{Lujala2007}, from which I take data on the location of petroleum fields. Although PETRODATA is clearly an invaluable asset and represents the most comprehensive freely accessible data source on petroleum fields, its coding of fields' first production dates features a substantial fraction of missing values.  Of the total 892 onshore petroleum field polygons identified in PETRODATA, first production dates are missing for 475 observations. Hence, even if we adopted a panel framework for testing the proposed hypotheses, there would not be much temporal variance in the main independent variable. The same caveat also applies, in lesser form, to the data from EPR and GeoEPR. Though the latter are time-variant, the exact dates assigned to the emergence of a salient ethnic identity must be interpreted with caution. Though the EPR coders have surely undertaken substantial efforts to make their codings as accurate as possible, the task of assigning a single year to the often long-winded process of the emergence of politically salient ethnic identities remains challenging.  Second, moving from a cross-section to a panel analysis would introduce substantial methodological challenges. In particular, we would have to address not just spatial, but also temporal (error-)dependence, which would introduce substantial additional complexity to the research design.

To test the proposed hypothesis, I analyze the number of politically relevant ethnic groups whose settlement regions overlap with a given areal unit. I use data from the year 2009, the last year covered by the EPR data set. I use data from 2009 because I assume that ethnic mobilization is cumulative: Once a group mobilizes and acts on the national political scene, it rarely disappears. Hence, I assume that the situation in 2009 is a reasonable summary of what happened up to that point in time.

\begin{figure}[H]
	\centering
		\includegraphics[scale=0.7]{Plots/epr.pdf}
	\caption{Politically relevant ethnic groups with territorially concentrated settlement areas in South-Sahara Africa and Asia in 2009, from GeoEPR \citep{Wucherpfennig2011}.}
	\label{Fig:epr}
\end{figure}


\paragraph{Treatment} \hspace{0pt} \\
I measure petroleum production in a given area using the PETRODATA data set by \cite{Lujala2007}. PETRODATA offers geo-coded data on the location of petroleum fields together with first production and discovery dates. An areal unit is coded has having been treated if it intersects with at least one oil or natural gas field that has been productive in the period between 1965 and 2007.\footnote{Petroleum fields identified as productive but without a first production date are assumed to have been producing in the studied time frame.}

\paragraph{Controls} \hspace{0pt} \\
I use a number of spatially disaggregated control variables for the matching procedure and subsequent econometric analysis. 

I control for the number of inhabitants of a given area because petroleum production and ethnopolitics may simply be correlated because they both require human settlement. I calculate the population living in a given areal unit in 1990 using the GPW (version 3) data set by \cite{CIESIN2005}. 

Further, I control for various geographic variables which may affect the studied relationship. I calculate the great circle distances between a given areal unit, its host country's capital, and the next international border as measures for peripherality.\footnote{All information on country borders and capitals is taken from the CShapes data set by \cite{Weidmann2010}.}  I control for rugged terrain by calculating the standard deviation in elevation points from a 30 arc-second digital elevation model (GTOPO30, \citealt{USGS1996}) which intersect with the given areal unit. I also calculate the area covered by a given areal unit as an additional control. Although I initially create units of analysis that are of equal size, I crop them if they intersect with international borders or coastlines in order to ensure that they can be associated unambiguously with a single country.  Consequently, the actual area covered by any given unit may vary slightly.

I also calculate a control variable measuring the number of linguistic groups in a given area based on the geo-coded World Language Mapping System (WLMS) data set (version 16.0, \citealt{GMI2013}). I include this control variable because I suspect that social heterogeneity is correlated with both the studied treatment and the outcome. On the one hand, it is straightforward to expect that additional social diversity in ascriptive traits like language, religion, or phenotype in a given region is associated with a larger number of potentially mobilizable ethnic groups. On the other hand, social diversity may also be associated with the probability of a given region featuring petroleum extraction. In particular, social heterogeneity may be a better proxy of peripherality than the aforementioned geographic controls. Diversity may, for instance, be the result of the failure of the central state to penetrate local societies \citep{Scott2009}. If social heterogeneity is in fact associated with weak state institutions, large-scale petroleum recovery operations will likely be more costly, and thus less frequent. Alternatively, \cite{michalopoulos2012origins} provides evidence that linguistic diversity is especially common in agriculturally fertile areas. Thus, social heterogeneity may be associated with favorable conditions for economic development, making petroleum extraction more, rather than less, likely. Optimally, I would control for these effects using a measure of social diversity that covers all common ethnic delimiters, i.e., language, religion, and phenotype. However, because such data is unavailable, I am only able to control for linguistic diversity. I further address this issue in the next section on case selection.

Finally, in addition to these spatially disaggregated variables, I also include country-level dummy variables in both the matching procedure and the subsequent econometric analysis. The reason is that there are numerous possible confounders on the country-level which are difficult to control for individually. In particular, factors like a country's overall level of economic development, quality of institutions, geographic extent and form, and colonial history may be associated with both its ethnopolitical landscape and the scope of large-scale industrial petroleum recovery operations. Especially with regard to the matching procedure one might imagine alternative strategies to eliminate the effects of analyzing hierarchically structured data, such as only matching within each parent unit. However, \cite{arpino2011specification} analyze the issue using Monte Carlo simulation and conclude that the dummy-approach chosen here is an efficient strategy. One consequence of matching on country-level dummies using propensity scores (as described below) is that countries without any productive petroleum fields will be excluded from the sample, since in petroleum-free countries, the dummy would perfectly predict the absence of the treatment. Though this is unfortunate, I contend that the added benefit of controlling for country-level confounders outweighs the costs of a smaller sample. 

\subsection{Case Selection}

In the current iteration of this paper, I restrict the empirical analysis to countries in South-Sahara Africa and Asia. The reasoning underlying this decision is twofold. First, although controlling for social heterogeneity is of key importance for the analysis, I can only measure linguistic diversity, as reported by the WLMS data set. I argue that controlling for linguistic diversity is an appropriate substitute in regions where ethnic identities form primarily along linguistic traits, which is the case in most countries of South-Sahara Africa and Asia. In contrast, salient ethnic identities in the Middle East focus very much on differences in religious denominations, which I cannot measure. Moreover, the WLMS-based linguistic diversity data is also inappropriate for measuring social heterogeneity in countries with a large European settler population, as is often the case in Latin America. The reason is that WLMS attempts to locate the geographical origin of modern languages, thus ignoring the distribution of European languages in ex-colonies.

Second, I focus on South-Sahara Africa and Asia because I expect that the hypothesized mechanisms are particularly likely to apply in relatively young ex-colonial states, which are located primarily in the selected regions. Ex-colonial states frequently feature political institutions that favor rent-seeking behavior and clientelism over other, issue-specific mobilization platforms. It is well known that colonial powers often left behind poorly crafted government institutions with a severe concentration of power at the executive level, and few independent constraints (see the discussion by \citet{Acemoglu2000} of extractive institutions, or Posner's (2007) characterization of African politics). This institutional setting provides strong incentives for the``prize-grabbing'' type of politics where political coalitions form along horizontal cleavages with the goal of gaining access to government hand-outs and benefits. The key problem underlying this dynamic is the absence of effective checks and balances on executive power (Fearon 1999, citing \citealt[p. 166]{Limongi1997}), which would ensure that incumbent governments do not divert state resources exclusively to their constituency, but install policies that benefit broad sections of the population. In the absence of such barriers, incumbents face strong incentives to secure their hold on power by providing ``pork'' to their supporters, and paying little attention to public goods provision.  In this context, political mobilization will rely largely on attempts to acquire part of the pie for one's own group, rather than any issue-specific platforms. It is exactly this type of environment where we would expect the hypothesized mechanisms to apply.

A second reason why we would expect ex-colonial states to be of particular interest for the theory at hand is the pertinence of ethnicity-based political coalitions in these countries. Clientelistic elites practicing``prize-grabbing'' politics to secure state-funded goods for their co-ethnic constituents has become the political modus operandi in many ex-colonial states \citep{Wimmer1997}, and particularly in South-Sahara Africa (\citealt{Lemerchand1972}, \citealt{Posner2005}). These ethnic rent-seeking practices are at least partially the legacy of European colonial rulers, who have often deliberately shaped ethnic identities and institutionalized ethnicity-based patronage systems to govern their dependencies (\citealt{Horowitz1985}, \citealt{Young1994}). Naturally, in countries where political payoffs are almost exclusively determined by membership in an ethnic group, we would expect individuals in resource-rich regions to be particularly likely to mobilize on the basis of ethnic identities, regardless of whether they do so to address externalities or to access resource rents. In contrast, in countries where political platforms based on issue-related and cross-cutting cleavages are well established, entering the political arena on the basis of local ethnic identities will be much more difficult. 

Finally, ex-colonial states may be particularly susceptible to the proposed mechanisms simply because their citizens have a comparatively short history of common statehood. A core tenet of the classic literature on nationalism by \citet{Deutsch1953}, \citet{Anderson2006}, and \citet{Gellner1983} is that state-building and nation-building go hand in hand, and that the continued existence of centralized governance institutions has played a significant role in shaping national identities in Europe \citep[p. 13]{Posner2004b}. Hence, in countries with a long history of statehood, we would generally expect stronger national identities and less social heterogeneity, which complicates political mobilization along sub-national ethnic delimiters. Accordingly, in young states, national identities in will be relatively weak, and local identities more readily available for political mobilization.

With the restriction to South-Sahara Africa and Asia in place, and in combination with the above-mentioned limitation to countries with productive petroleum fields, I end up with a set of 28 countries from subnational areal units are sampled for analysis.

\subsection{Supervised Areal Unit Sampling}

I compile the data for testing the proposed hypothesis using a new spatial matching design which I refer to as \emph{supervised areal unit sampling}. The idea is to compare arbitrarily defined areal units having received the treatment (i.e., featuring petroleum production), with petroleum-free control units that are as similar as possible in all relevant aspects. 

A key element of the research design presented in this section is the use of arbitrarily defined areal units. By this I mean the use of units of analysis that are defined solely via their spatial extent, and do not refer to any politically or socially meaningful category. I rely on such units for two reasons. First, they allow spatially disaggregated analysis. This aspect is of importance because spatial proximity between the treatment and the outcome is a key element of the hypothesis under investigation. Any country-level analysis of the relationship between petroleum production and ethnic mobilization would constitute only a partial test of the proposed mechanisms.  Second, the employed units of analysis are clearly exogenous to the phenomenon under investigation. The use of “natural” units of analysis, such as administrative units, bears the risk of introducing endogeneity issues to the analysis. In particular, the location and spatial extent of administrative units may very well be associated with the presence of politically relevant ethnic groups, as well as large-scale petroleum extraction operations, thus impeding an unbiased analysis of the two variables. 

The use of arbitrarily defined areal units raises the question of which units to include in the sample to be analyzed. There are many possible strategies to sample spatial units from a given plane. For instance, an increasingly popular approach in quantitative conflict research is the use of grid cells covering the entire landmass to be studied (e.g., \citealt{Buhaug2006}, \citealt{Theisen2012}). 

I depart from this standard approach and propose supervised sampling technique that is designed to find a balance between three objectives:
\begin{enumerate}
\item[i] \emph{Generate a representative sample.} By this I mean the objective that the analyzed data set should be large enough to allow generalizable inference across countries.
\item[ii] \emph{Generate a balanced sample.} Based on the literature on statistical matching, we generally wish for a sample where the treatment and control groups are similar in all relevant aspects, thus reducing model dependence when estimating the effect of the treatment \citep{ho2007matching}.
\item[iii] \emph{Minimize spatial dependence in the sample.} The use of spatially defined units of analysis will often entail spatial dependence, i.e., the outcome variable in proximate units will not be independently distributed. This may cause two issues for statistical analysis: First, most regression models only allow unbiased inference in the presence of conditional independence. If proximate units feature correlated errors, statistical procedures based on the $iid$ assumption will yield biased estimates. The second issue arises if the studied treatment exhibits diffusion effects, i.e., if its presence in a treated unit affects the outcome of nearby control units. In the language of the Neyman-Rubin causal model, diffusion implies a violation of the Stable Unit Treatment Value Assumption (SUTVA, see \citealt{sekhon2008neyman} for a discussion), which entails biased estimates for the treatment's effect on the outcome. To eliminate these issues, we generally wish to create a data set where spatial dependence is minimal, or at least tractable with appropriate methodology.
\end{enumerate}

Obviously, there are inherent trade-offs when pursuing these objectives. Generally, a larger (more ``representative'') sample will lead to less balance, and will increase spatial dependence between observations. In the extreme case of the most representative sample possible, with small units covering the entire plane, spatial dependence will be so large that there is hardly any independent variance among neighboring observations, regardless of the outcome under scrutiny. On the other end of the spectrum, one could sample a data set including only very few, perfectly balanced, and far apart treatment-control pairs, which would largely eliminate model- and spatial dependence, but impede generalizable inference.

To find an acceptable trade-off between these objectives, I propose a spatial matching procedure which operates as follows:

\begin{enumerate}
\item Cover the studied plane with a large number of randomly placed circular candidate units of fixed size. Candidate units may overlap.
\item Map all covariates onto all candidate units and identify treated and control units, i.e., whether units overlap with productive petroleum fields.
\item Remove all candidate control units within a pre-specified distance $D$ of the treated units. This is to eliminate control units that are potentially affected by nearby treatments via diffusion, thus reducing the likelihood of SUTVA violations in the data.
\item Estimate a propensity score for each unit, i.e., the probability of the unit being treated given the controls. 
\item For each treated unit: Identify the $k$ control units with the closest propensity scores (i.e., with the smallest absolute propensity score distances). Optionally, a caliper $c$ can be enforced; that is, only candidate control units with a propensity score difference within the defined caliper are matched.
\item Identify the treated unit with the best matches, i.e., the treated unit where the average absolute propensity score difference of the candidate matches are smallest. Add the selected treated unit and its matched control units to the list of selected observations.
\item Remove the selected treated and control units from the list of candidate units. 
\item Remove all units overlapping with the selected treated and control units from the list of candidate units.
\item Repeat steps 5--8 until there are no more treatment or control units in the list of candidate units.
\end{enumerate}

This procedure follows the logic of $k$-nearest neighbor propensity score matching \citep{rosenbaum1985constructing}, but with two major modifications. First, unlike the standard propensity score matching algorithm, the one proposed here optimizes balance under the side condition that no two units in the matched sample overlap. Second, the introduced algorithm is less greedy than the one commonly used in propensity score matching. In the standard approach, the treated units are shuffled and then matched to their closest neighbors in random order, which may lead to globally inefficient outcomes. Here, we progress through the list of candidate treatments by selecting the best matched set of all possible matches in the current iteration. This approach is still not globally optimal, but it removes some of the stochastic inefficiencies caused by the fact that each match affects the list of candidate units disproportionally via the deletion of overlapping units.

The key benefit of this procedure is that it allows optimizing the trade-off between the stated objectives with a small number of parameters. Enforcing stricter matching parameters (smaller $k$ and $c$) and setting a greater minimal distance between control and treatment units (larger $D$), will yield a smaller, more balanced sample. Moreover, the issue of spatial dependence will be attenuated because the areal units are more distant from each other, and smaller sample sizes allow the application of spatial econometric methods to model residual spatial dependence parametrically. On the other hand, relaxing the restrictions on the matching and minimal distance parameters will yield a data set that resembles the grid-cell approach. In the extreme case where we enforce no balance requirements and do not require a minimal distance between treatment and control units, we will end up with a data-set of non-overlapping circular units covering almost the entire plane.
 
I argue that the proposed geographic matching approach is superior to its alternatives because it allows managing the trade-off between the three stated objectives more productively. Specifically, using arbitrarily defined grid-cells covering the entire plane to be analyzed sacrifices the balance and spatial dependence objectives for the sake of a fully inclusive sample. Applying spatial econometric models to accommodate for spatial dependence is often impossible in grid-cell designs because the number of observations is so large that estimating spatial models becomes computationally infeasible (see, e.g., \citealt{Buhaug2011}). I would argue that it is often preferable to analyze only a subset of the available data if doing so entails less model dependence, and allows estimating models that accommodate spatial interdependence. As an alternative to the grid-cell approach, one might also consider reducing sample size by drawing a completely random subset of all possible spatial units. This approach is pursued and advertised by \cite{Buhaug2011}. However, while the matching design proposed above is equally effective in reducing sample size, it features the added benefit of producing a balanced sample.

\subsection{Econometric Analysis}
\label{sec:econ}

Due to the complex nature of the data being analyzed, simple mean comparisons between the sampled treatment and control groups are unlikely to yield unbiased estimates of the treatment's effect on the outcome.\footnote{In the language of the Neyman-Rubin causal model, we are, in principal, interested in estimating the average treatment effect on the treated (ATT). However, since I analyze the matched data with parametric statistical models which are not embedded in the N-R causal model, I will refrain from interpreting any estimated effects as ATT estimates. Rather, in accordance with \cite{ho2007matching}, I treat the matching procedure as a data preprocessing step with the goal reducing model dependence in subsequent parametric analysis.}  In particular, it is likely that the areal units sampled via the matching procedure will still exhibit spatial interdependence. Although the minimal distance requirement between treatment and control units used in the matching procedure will likely attenuate the bias due to SUTVA violations, the matched sample may still feature areal units that are in close proximity to each other. Consequently, the $iid$ assumption may still be violated, and may lead to biased estimates without proper adjustments.

I pursue two strategies to arrive at credible estimates despite potential spatial dependence. First, I employ spatial error models to estimate the effect of local petroleum production on the number of politically relevant ethnic groups. Spatial error models are linear regression models where the assumption of independently distributed errors is relaxed. Instead, it is assumed that the (normally distributed) errors follow a spatial autoregressive process, parameterized via a row-standardized spatial weights matrix $W$ and a spatial dependence parameter $\lambda$ \cite[ch. 3]{Ward2008}. Formally, we assume
\begin{align*}
y & = X\beta + u \\
u & = \lambda W u + \epsilon \\
\epsilon & \sim N(0, \sigma^2 I).
\end{align*}
To model the assumption that the errors of nearby units are non-independent, I populate the (unstandardized) spatial weights matrix with inverse geodesic distances between the centroids of the sampled units. When deriving the results reported in the next section, I experimented with several plausible implementations of this design. In particular, I tested a fully connected inverse-distance matrix, a block-diagonal spatial weights matrix which only allows non-zero spatial dependence for units within the same country, and an even more restrictive version where non-zero dependence is only allowed for the nearest three units within the same country. All results reported in the next section are based on the latter nearest-3 block-diagonal weights matrix, as it consistently yielded the lowest AIC scores.

Second, as a non-parametric alternative to the spatial error models, I account for non-independent errors using a cluster-wise bootstrap procedure. In a first step, point estimates are derived by fitting a standard generalized linear model to the entire matched sample, regardless of potential $iid$ violations. Then, standard errors are calculated on the basis of a non-parametric bootstrap procedure where we resample with replacement on the country-level. This method yields unbiased standard error estimates under the assumption that the areal units have dependent errors within countries, but not across countries \citep[p. 601]{Fox2008}.\footnote{Note that this assumption receives support from the spatial error model implementations, where a country-wise block-diagonal weights matrix outperformed the fully connected weights matrix in terms of model fit.} The cluster-wise bootstrap is similar to the calculation of robust clustered standard errors (RCSE) based on M-estimation theory \citep{Rogers1993}. However, RCSEs are only asymptotically efficient in the number of clusters available, which is only 28 in the present application. In small sample settings, bootstrap procedures generally outperform asymptotic results in terms of efficiency \citep[ch. 21]{Fox2008}. The key advantage of the cluster-wise bootstrap over the spatial error model is that it does not require us to make assumptions about the type of dependence observed within countries via specifying a spatial weights matrix. On the other hand, the cluster-bootstrap estimates are of course less efficient than those of a correctly specified parametric model. 

\section{Results}
\label{Sec:5}
Testing the hypothesis using the proposed sampling method necessitates making a number of (partially) random modeling decisions. In particular, the extent of the candidate areas used as a basis for the matching algorithm needs to be determined ex-ante, and may be important for whether we are able to observe the postulated effect. If we base the analysis on units that are too large, we might not observe the hypothesized effect because the outcome variable will pick up too much ``noise'' that is unrelated to petroleum extraction. On the other hand, if the studied units are too small, we might overlook the postulated effect because the outcome variable does not feature sufficient variance. Because we have little theoretical guidance when deciding on a particular unit size, I let the data speak and rerun the entire analysis using different specifications. Doing so also helps mitigating the modifiable areal unit problem, i.e., the risk that reported evidence is hand-picked from a set of models estimated at different geographical scales such that the postulated hypothesis appears unequivocally supported by the data \citep{Openshaw1983}.

I first discuss the results obtained from constructing circular candidate units with a 100 kilometer radius, i.e., covering approximately 31'500 square kilometers.\footnote{Which is about the size of Belgium.} For creating the matched sample, I first generated about 15'000 randomly placed candidate units covering the entire landmass of the 28 countries in the study region with productive petroleum fields.  Next, I identified treatment and control units, removed all controls within a 200 kilometer radius of any treated region, and estimated propensity scores, i.e., each unit's probability of being treated given the spatial controls and the country dummies. The propensity scores were estimated using a semi-parametric generalized additive model with penalized smoothing splines for each spatial covariate (see, e.g., \citealt{Keele2008}). I then applied the spatial matching algorithm, matching exactly one control unit to every (non-overlapping) treated unit, and enforcing a caliper of $5*10^{-4}$. The caliper was set in order to exclude a small number of very poorly matched pairs which decreased balance in the sample substantively. However, removing the caliper entirely has no major effect on the results. 

This procedure yields a sample of 242 matched observations. Figure \ref{Fig:100ps} shows a quantile-quantile plot of the propensity scores in the matched sample and the original sample of candidate units. Similarly, figure \ref{Fig:100cb} shows quantile-quantile plots for all spatial covariates. Quite clearly, balance was improved substantially through the matching procedure, as also evidenced by the Kolmogorov-Smirnov statistics reported in the upper-left and lower-right corner of the plots. Finally, Figure \ref{Fig:100map} maps the matched sample and identifies treatment and control units.

\begin{figure}[H]
	\centering
		\includegraphics[scale=0.6]{Plots/100km_PSQQ.pdf}
	\caption{Quantile-quantile plot of propensity scores for original and matched samples using 100km areal units. Red dots refer to the matched sample, whereas black dots refer to the original candidate units.}
	\label{Fig:100ps}
\end{figure}

\begin{figure}[H]
	\centering
		\includegraphics[scale=0.75]{Plots/100km_COVQQ.pdf}
	\caption{Quantile-quantile plots of spatial covariates for original and matched samples using 100km areal units. Red dots refer to the matched sample, whereas black dots refer to the original candidate units. All variables scaled to the $0-100$ interval.}
	\label{Fig:100cb}
\end{figure}

\begin{figure}[H]
	\centering
		\includegraphics[scale=0.5]{Plots/100km_matchmap.pdf}
	\caption{The matched sample of 100km radius areal units.}
	\label{Fig:100map}
\end{figure}

The first two columns of table \ref{tab1} show the estimates obtained by fitting a generalized linear model with a Quasi-Poisson stochastic component to the matched sample. The Quasi-Poisson GLM is a modification of the Poisson model for counts that allows for overdispersion in the outcome \citep[p. 391]{Fox2008}. Quite clearly, the estimate for the petroleum dummy is positive and significant at conventional significance levels regardless of whether we include the controls in the model. As discussed in the previous section, however, these estimates might be misleading because the $iid$ assumption underlying the generalized linear model is probably violated due to spatial error dependence. 

\begin{sidewaystable}
\footnotesize
\centering
  \caption{Results based on matched 100km radius areal units.} 
  \label{tab1} 
\begin{tabular}{@{\extracolsep{5pt}}lcccccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
Model & \multicolumn{2}{c}{GLM Quasi Poisson} & \multicolumn{2}{c}{SEM} & \multicolumn{2}{c}{GLM Quasi Poisson \& Bootstrap} \\ 
Response & \multicolumn{2}{c}{group count} & \multicolumn{2}{c}{ln(group count + 1)} & \multicolumn{2}{c}{group count} \\ 
\\[-1.8ex] & (1) & (2) & (3) & (4) & (5) & (6)\\ 
\hline \\[-1.8ex] 
 petroleum & 0.240$^{**}$ & 0.190$^{***}$ & 0.184$^{***}$ & 0.138$^{***}$ & 0.240$^{*}$ & 0.190$^{**}$ \\ 
  & (0.102) & (0.072) & (0.049) & (0.043) & (0.142) & (0.091) \\ 
  & & & & & & \\ 
 ln(area) &  & 0.242 &  & 0.097 &  & 0.242 \\ 
  &  & (0.151) &  & (0.067) &  & (0.276) \\ 
  & & & & & & \\ 
 ln(pop) &  & 0.022 &  & 0.030$^{*}$ &  & 0.022 \\ 
  &  & (0.023) &  & (0.017) &  & (0.041) \\ 
  & & & & & & \\ 
 ln(capital dist) &  & 0.010 &  & $-$0.008 &  & 0.010 \\ 
  &  & (0.034) &  & (0.018) &  & (0.035) \\ 
  & & & & & & \\ 
 ln(border dist) &  & 0.019 &  & 0.023 &  & 0.019 \\ 
  &  & (0.039) &  & (0.021) &  & (0.043) \\ 
  & & & & & & \\ 
 elevation &  & 0.0003$^{**}$ &  & 0.0002$^{***}$ &  & 0.0003 \\ 
  &  & (0.0001) &  & (0.0001) &  & (0.0002) \\ 
  & & & & & & \\ 
 ln(language) &  & 0.423$^{***}$ &  & 0.195$^{***}$ &  & 0.423$^{**}$ \\ 
  &  & (0.061) &  & (0.038) &  & (0.191) \\ 
  & & & & & & \\ 
\hline \\[-1.8ex] 
Country-Dummies & No & Yes & No & Yes & No & Yes \\
Observations & 242 & 242 & 242 & 242 & 242 & 242 \\ 
Log Likelihood &  &  & $-$109.962 & $-$50.563 &  &  \\ 
AIC &  &  & 227.923 & 171.126 &  &  \\ 
$\hat{\lambda}$ &  &  &  0.688 & 0.482 &  &  \\ 
LR Test (df = 1) &  &  & 168.614$^{***}$ & 49.681$^{***}$ &  &  \\ 
\hline 
\hline \\[-1.8ex] 
\textit{Notes:}  & \multicolumn{6}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ 
		     & \multicolumn{6}{r}{Estimates for intercepts and country-dummies not reported.} \\ 
		     & \multicolumn{6}{r}{Standard errors in parantheses.} \\ 
\normalsize 
\end{tabular} 
\end{sidewaystable}

Columns 3 and 4 report the estimates obtained by fitting spatial error models with nearest-3 block-diagonal spatial weights matrices (see section \ref{sec:econ}) to the log-transformed response.\footnote{The log-transform was used to generate a response with less positive skew. The resulting transformed outcome variable conforms more closely to the normality assumption underlying the spatial error model.} As expected, the estimated spatial dependence parameter ($\hat{\lambda}$) is positive, and a likelihood-ratio test against the $iid$ alternative clearly rejects the null of no spatial dependence. However, the estimates associated with the petroleum dummy are still clearly positive and significant.

Finally, columns 5 and 6 show the estimates obtained by fitting the Quasi-Poisson GLM to the data, but estimating the standard errors using the cluster-wise bootstrap procedure. In line with expectations, the bootstrapped standard errors are notably larger than those obtained under the (erroneous) $iid$ assumption, but we can still reject the null that petroleum has no effect on the number of politically relevant ethnic groups in an area at the 10\% and 5\% levels, respectively. 

In summary, given the chosen modeling parameters, the evidence does suggest that petroleum producing areas generally feature a larger number of politically relevant and territorially concentrated ethnic groups than comparable, petroleum-free regions. I repeated the exact same analysis (with equivalent parameters for the matching procedure) on circular units with 20km and 50km radius. The detailed results can be found in the appendix. Interestingly, while balance and sample sizes increased notably using these smaller units, the substantive results remain unchanged. 

Despite these remarkably robust results, caution is advised. A number of additional methodological issues still need to be addressed in future iterations of this paper. First, as illustrated by figure \ref{Fig:100map}, PETRODATA is fairly liberal in coding productive petroleum fields, and likely overstates the number of actual petroleum producing regions in the studied countries. For this reason, the analyses should to be replicated with a more restrictive coding of petroleum fields. Second, some countries dominate the sample simply due to their areal extent (namely India and China). The analysis should be repeated using a leave-one-out cross-validation strategy to evaluate whether the results are being driven by a single case.


\section{Conclusion}
\label{Sec:7}

In this paper, I provide evidence for the hypothesis that petroleum extraction promotes ethnic mobilization. Specifically, we observe that petroleum-producing regions in South-Sahara African and Asian countries systematically feature a larger number of politically relevant, territorially concentrated ethnic groups than comparable petroleum-free regions. Notably, this result holds even if we control for ethnolinguistic diversity, suggesting that individuals in petroleum-rich regions are indeed more likely to make use of pre-existing linguistic cleavages for political mobilization, rather than petroleum production being associated with social diversity for some other reason.
Although the quantitative evidence does not allow identifying the exact mechanism underlying this result, there are plausible candidate explanations. In particular, I argue that it is likely that both, incentives for rent-seeking and ethnic patronage, as well as local resistance against the environmental and social externalities of industrial petroleum production, contribute to explaining the phenomenon.

Thought still tentative, the result that petroleum production appears to promote ethnic mobilization has a number of interesting implications. First, the fact that we can trace back the presence of politically relevant ethnic identities to petroleum production speaks in favor of the instrumentalist assumption that ethnic salience is the outcome of relatively short-term, individual-level means-ends reasoning. Second, the finding that petrolem appears to affect the ethnopolitical constellation of a country may help deepen our understanding of the various ``resource curses'' identified in previous research. In fact, it is not implausible that the often reported adverse effects of petroleum production on economic growth (e.g. \citealt{Sachs2001}), democratization (e.g. \citealt{Ross2001}, \citealt{Smith2004}, \citealt{Jensen2004}), and violent intrastate conflict (e.g. \citealt{Humphreys2005}, \citealt{Ross2006}, \citealt{Lujala2010}) operate at least partially via ethnic mobilization. To what degree and how this is the case should be addressed in further research.

Beyond these substantive findings, the present paper has also introduced a new spatial matching procedure, dubbed \emph{spatial areal unit sampling}, with the explicit goal of finding a trade-off between sample size, balance, and spatial dependence in research designs based on arbitrary spatial units of analysis. Though a rigorous theoretical analysis of the proposed method's performance is still pending, it has been applied successfully in the present paper for creating a balanced subset of arbitrary areal units. 

Finally, there are a number of issues that still need to be addressed in future iterations of this paper. First, for reasons discussed in the previous section, the reported results are still tentative, and a number of additional robustness tests need to be conducted before drawing any definite conclusions. Second, although I believe that the limited geographic scope of the present study is well justified, it would still be interesting to see whether the argument applied to other regions. In particular, Latin America has seen numerous ``indigenous''  protest movements as a reaction to large-scale mining projects, and it is plausible that similar mechanisms are at work there (see, e.g., \citealt{bebbington2008mining}). Relatedly, it should be relatively easy to replicate the analysis with other (mineral) resources; in fact, many of the proposed arguments apply almost unconditionally to any high-value mineral resource requiring industrial extraction methods.


\newpage
\singlespacing
\bibliographystyle{chicago}
\bibliography{eprembib}

\newpage
\section{Appendix: Results for Alternative Areal Unit Sizes}

\begin{figure}[H]
	\centering
		\includegraphics[scale=0.7]{Plots/50km_COVQQ.pdf}
	\caption{Quantile-quantile plots of spatial covariates for original and matched samples using 50km areal units. Red dots refer to the matched sample, whereas black dots refer to the original candidate units. All variables scaled to the $0-100$ interval.}
	\label{Fig:50cb}
\end{figure}

\begin{sidewaystable}
\footnotesize
\centering
  \caption{Results based on matched 50km radius areal units.} 
  \label{tab2} 
\begin{tabular}{@{\extracolsep{5pt}}lcccccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
Model & \multicolumn{2}{c}{GLM Quasi Poisson} & \multicolumn{2}{c}{SEM} & \multicolumn{2}{c}{GLM Quasi Poisson \& Bootstrap} \\ 
Response & \multicolumn{2}{c}{group count} & \multicolumn{2}{c}{ln(group count + 1)} & \multicolumn{2}{c}{group count} \\ 
\\[-1.8ex] & (1) & (2) & (3) & (4) & (5) & (6)\\ 
\hline \\[-1.8ex] 
 petroleum & 0.218$^{***}$ & 0.157$^{***}$ & 0.119$^{***}$ & 0.090$^{**}$ & 0.218$^{*}$ & 0.157$^{**}$ \\ 
  & (0.077) & (0.056) & (0.044) & (0.038) & (0.114) & (0.077) \\ 
  & & & & & & \\ 
 ln(area) &  & 0.117 &  & 0.048 &  & 0.117 \\ 
  &  & (0.130) &  & (0.057) &  & (0.164) \\ 
  & & & & & & \\ 
 ln(pop) &  & 0.010 &  & 0.021$^{*}$ &  & 0.010 \\ 
  &  & (0.018) &  & (0.012) &  & (0.039) \\ 
  & & & & & & \\ 
 ln(capital dist) &  & 0.006 &  & $-$0.0001 &  & 0.006 \\ 
  &  & (0.029) &  & (0.018) &  & (0.044) \\ 
  & & & & & & \\ 
 ln(border dist) &  & 0.033 &  & 0.007 &  & 0.033 \\ 
  &  & (0.028) &  & (0.015) &  & (0.034) \\ 
  & & & & & & \\ 
 elevation &  & 0.0003$^{**}$ &  & 0.0001 &  & 0.0003 \\ 
  &  & (0.0001) &  & (0.0001) &  & (0.001) \\ 
  & & & & & & \\ 
 ln(language) &  & 0.401$^{***}$ &  & 0.171$^{***}$ &  & 0.401$^{***}$ \\ 
  &  & (0.061) &  & (0.038) &  & (0.151) \\ 
  & & & & & & \\ 
\hline \\[-1.8ex] 
Country-Dummies & No & Yes & No & Yes & No & Yes \\
Observations & 388 & 388 & 388 & 388 & 388 & 388 \\ 
Log Likelihood &  &  & $-$138.674 & $-$74.910 &  &  \\ 
AIC &  &  & 285.349 & 221.820 &  &  \\ 
$\hat{\lambda}$ &  &  &  0.664 & 0.401 &  &  \\ 
LR Test (df = 1) &  &  & 249.359$^{***}$ & 52.074$^{***}$ &  &  \\ 
\hline 
\hline \\[-1.8ex] 
\textit{Notes:}  & \multicolumn{6}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ 
		     & \multicolumn{6}{r}{Estimates for intercepts and country-dummies not reported.} \\ 
		     & \multicolumn{6}{r}{Standard errors in parantheses.} \\ 
\normalsize 
\end{tabular} 
\end{sidewaystable}

\begin{figure}[H]
	\centering
		\includegraphics[scale=0.75]{Plots/20km_COVQQ.pdf}
	\caption{Quantile-quantile plots of spatial covariates for original and matched samples using 20km areal units. Red dots refer to the matched sample, whereas black dots refer to the original candidate units. All variables scaled to the $0-100$ interval.}
	\label{Fig:20cb}
\end{figure}

\begin{sidewaystable}
\footnotesize
\centering
  \caption{Results based on matched 20km radius areal units.} 
  \label{tab3} 
\begin{tabular}{@{\extracolsep{5pt}}lcccccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
Model & \multicolumn{2}{c}{GLM Quasi Poisson} & \multicolumn{2}{c}{SEM} & \multicolumn{2}{c}{GLM Quasi Poisson \& Bootstrap} \\ 
Response & \multicolumn{2}{c}{group count} & \multicolumn{2}{c}{ln(group count + 1)} & \multicolumn{2}{c}{group count} \\ 
\\[-1.8ex] & (1) & (2) & (3) & (4) & (5) & (6)\\ 
\hline \\[-1.8ex] 
 petroleum & 0.135$^{***}$ & 0.141$^{***}$ & 0.077$^{**}$ & 0.072$^{***}$ & 0.135 & 0.141$^{**}$ \\ 
  & (0.049) & (0.033) & (0.037) & (0.027) & (0.100) & (0.070) \\ 
  & & & & & & \\ 
 ln(area) &  & 0.228$^{**}$ &  & 0.014 &  & 0.228 \\ 
  &  & (0.113) &  & (0.041) &  & (0.200) \\ 
  & & & & & & \\ 
 ln(pop) &  & 0.041$^{***}$ &  & 0.029$^{***}$ &  & 0.041$^{**}$ \\ 
  &  & (0.010) &  & (0.007) &  & (0.019) \\ 
  & & & & & & \\ 
 ln(capital dist) &  & 0.044$^{**}$ &  & 0.015 &  & 0.044 \\ 
  &  & (0.021) &  & (0.014) &  & (0.035) \\ 
  & & & & & & \\ 
 ln(border dist) &  & 0.023 &  & 0.007 &  & 0.023 \\ 
  &  & (0.016) &  & (0.009) &  & (0.026) \\ 
  & & & & & & \\ 
 elevation &  & 0.0001 &  & $-$0.0001 &  & 0.0001 \\ 
  &  & (0.0001) &  & (0.0001) &  & (0.001) \\ 
  & & & & & & \\ 
 ln(language) &  & 0.358$^{***}$ &  & 0.130$^{***}$ &  & 0.358$^{***}$ \\ 
  &  & (0.051) &  & (0.027) &  & (0.116) \\ 
  & & & & & & \\ 
\hline \\[-1.8ex] 
Country-Dummies & No & Yes & No & Yes & No & Yes \\
Observations & 872 & 872 & 872 & 872 & 872 & 872 \\ 
Log Likelihood &  &  & $-$123.683 & 0.607 &  &  \\ 
AIC &  &  & 255.367 & 70.786 &  &  \\ 
$\hat{\lambda}$ &  &  &  0.715 & 0.484 &  &  \\ 
LR Test (df = 1) &  &  & 721.007$^{***}$ & 216.787$^{***}$ &  &  \\ 
\hline 
\hline \\[-1.8ex] 
\textit{Notes:}  & \multicolumn{6}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ 
		     & \multicolumn{6}{r}{Estimates for intercepts and country-dummies not reported.} \\ 
		     & \multicolumn{6}{r}{Standard errors in parantheses.} \\ 
\normalsize 
\end{tabular} 
\end{sidewaystable}

\end{document}